AITopics | data acquisition

Collaborating Authors

data acquisition

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Approximation-Aware Bayesian Optimization

Neural Information Processing SystemsMar-19-2026, 03:36:54 GMT

High-dimensional Bayesian optimization (BO) tasks such as molecular design often require $> 10,$$000$ function evaluations before obtaining meaningful results. While methods like sparse variational Gaussian processes (SVGPs) reduce computational requirements in these settings, the underlying approximations result in suboptimal data acquisitions that slow the progress of optimization. In this paper we modify SVGPs to better align with the goals of BO: targeting informed data acquisition over global posterior fidelity. Using the framework of utility-calibrated variational inference (Lacoste-Julien et al., 2011), we unify GP approximation and data acquisition into a joint optimization problem, thereby ensuring optimal decisions under a limited computational budget. Our approach can be used with any decision-theoretic acquisition function and is readily compatible with trust region methods like TuRBO (Eriksson et al., 2019). We derive efficient joint objectives for the expected improvement (EI) and knowledge gradient (KG) acquisition functions in both the standard and batch BO settings. On a variety of recent high dimensional benchmark tasks in control and molecular design, our approach significantly outperforms standard SVGPs and is capable of achieving comparable rewards with up to $10\times$ fewer function evaluations.

artificial intelligence, name change, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.40)

Add feedback

Causal-BALD: Deep Bayesian Active Learning of Outcomes to Infer Treatment-Effects from Observational Data

Neural Information Processing SystemsDec-25-2025, 08:43:21 GMT

Estimating personalized treatment effects from high-dimensional observational data is essential in situations where experimental designs are infeasible, unethical, or expensive. Existing approaches rely on fitting deep models on outcomes observed for treated and control populations. However, when measuring individual outcomes is costly, as is the case of a tumor biopsy, a sample-efficient strategy for acquiring each result is required. Deep Bayesian active learning provides a framework for efficient data acquisition by selecting points with high uncertainty. However, existing methods bias training data acquisition towards regions of non-overlapping support between the treated and control populations. These are not sample-efficient because the treatment effect is not identifiable in such regions. We introduce causal, Bayesian acquisition functions grounded in information theory that bias data acquisition towards regions with overlapping support to maximize sample efficiency for learning personalized treatment effects. We demonstrate the performance of the proposed acquisition strategies on synthetic and semi-synthetic datasets IHDP and CMNIST and their extensions, which aim to simulate common dataset biases and pathologies.

causal-bald, deep bayesian active learning, infer treatment-effect, (6 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Diagnostic Medicine (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Control Your Robot: A Unified System for Robot Control and Policy Deployment

Nian, Tian, Ke, Weijie, Zhu, Shaolong, Hu, Bingshan

arXiv.org Artificial IntelligenceDec-10-2025

Cross-platform robot control remains difficult because hardware interfaces, data formats, and control paradigms vary widely, which fragments toolchains and slows deployment. To address this, we present Control Your Robot, a modular, general-purpose framework that unifies data collection and policy deployment across diverse platforms. The system reduces fragmentation through a standardized workflow with modular design, unified APIs, and a closed-loop architecture. It supports flexible robot registration, dual-mode control with teleoperation and trajectory playback, and seamless integration from multimodal data acquisition to inference. Experiments on single-arm and dual-arm systems show efficient, low-latency data collection and effective support for policy learning with imitation learning and vision-language-action models. Policies trained on data gathered by Control Your Robot match expert demonstrations closely, indicating that the framework enables scalable and reproducible robot learning across platforms.

artificial intelligence, arxiv preprint arxiv, data collection, (15 more...)

arXiv.org Artificial Intelligence

2509.23823

Genre: Research Report (0.85)

Technology: Information Technology > Artificial Intelligence > Robots > Manipulation (0.47)

Add feedback

How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

Huang, Xiwen, Pinson, Pierre

arXiv.org Machine LearningNov-26-2025

We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to a benchmark random sampling approach. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

active learning market, learning, pricing, (15 more...)

arXiv.org Machine Learning

2511.20605

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > United Kingdom (0.04)
Europe > Denmark (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Energy (1.00)
Banking & Finance > Trading (0.93)
Banking & Finance > Real Estate (0.68)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback

CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

Kim, Hyeonjae, Li, Chenyue, Deng, Wen, Jin, Mengxi, Huang, Wen, Lu, Mengqian, Yuan, Binhang

arXiv.org Artificial IntelligenceNov-26-2025

Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.20109

Country:

Asia > China (0.46)
North America > United States (0.28)
Asia > Middle East > UAE (0.28)

Genre:

Workflow (1.00)
Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

What's the next frontier for Data-centric AI? Data Savvy Agents

Seedat, Nabeel, Liu, Jiashuo, van der Schaar, Mihaela

arXiv.org Artificial IntelligenceNov-4-2025

The recent surge in AI agents that autonomously communicate, collaborate with humans and use diverse tools has unlocked promising opportunities in various real-world settings. However, a vital aspect remains underexplored: how agents handle data. Scalable autonomy demands agents that continuously acquire, process, and evolve their data. In this paper, we argue that data-savvy capabilities should be a top priority in the design of agentic systems to ensure reliable real-world deployment. Specifically, we propose four key capabilities to realize this vision: (1) Proactive data acquisition: enabling agents to autonomously gather task-critical knowledge or solicit human input to address data gaps; (2) Sophisticated data processing: requiring context-aware and flexible handling of diverse data challenges and inputs; (3) Interactive test data synthesis: shifting from static benchmarks to dynamically generated interactive test data for agent evaluation; and (4) Continual adaptation: empowering agents to iteratively refine their data and background knowledge to adapt to shifting environments. While current agent research predominantly emphasizes reasoning, we hope to inspire a reflection on the role of data-savvy agents as the next frontier in data-centric AI.

data mining, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.01015

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
(3 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
(2 more...)

Add feedback

Data as a Lever: A Neighbouring Datasets Perspective on Predictive Multiplicity

Ganesh, Prakhar, Hsu, Hsiang, Farnadi, Golnoosh

arXiv.org Artificial IntelligenceOct-27-2025

Multiplicity -- the existence of distinct models with comparable performance -- has received growing attention in recent years. While prior work has largely emphasized modelling choices, the critical role of data in shaping multiplicity has been comparatively overlooked. In this work, we introduce a neighbouring datasets framework to examine the most granular case: the impact of a single-data-point difference on multiplicity. Our analysis yields a seemingly counterintuitive finding: neighbouring datasets with greater inter-class distribution overlap exhibit lower multiplicity. This reversal of conventional expectations arises from a shared Rashomon parameter, and we substantiate it with rigorous proofs. Building on this foundation, we extend our framework to two practical domains: active learning and data imputation. For each, we establish natural extensions of the neighbouring datasets perspective, conduct the first systematic study of multiplicity in existing algorithms, and finally, propose novel multiplicity-aware methods, namely, multiplicity-aware data acquisition strategies for active learning and multiplicity-aware data imputation techniques.

artificial intelligence, machine learning, multiplicity, (17 more...)

arXiv.org Artificial Intelligence

2510.21303

Country: North America > Canada (0.28)

Genre: Research Report (1.00)

Industry: Banking & Finance (0.46)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Appendix A Data Acquisition

Neural Information Processing SystemsOct-8-2025, 09:55:04 GMT

The full procedure of data collection and data preprocessing is described in detail in Ref. [ In addition, a top-down camera recorded the mouse in the arena at 60 Hz . Kernel size 7 performed best. We repeated the grid search for CNNs with different number of convolutional layers. We initially hypothesized that an autoencoder could provide regularization benefits over a "vanilla" CNN, because the reconstruction loss might encourage the model to learn visual features that are useful for decoding. After hyperparameter search, we settled on size 256 for the latent space vector, and the weight of the reconstruction loss relative to the Poisson loss was fixed at 0.5.

behavioral variable, cnn, data acquisition, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.58)

Add feedback

Safety Assessment of Scaffolding on Construction Site using AI

Prabhu, Sameer, Patwardhan, Amit, Karim, Ramin

arXiv.org Artificial IntelligenceSep-29-2025

In the construction industry, safety assessment is vital to ensure both the reliability of assets and the safety of workers. Scaffolding, a key structural support asset requires regular inspection to detect and identify alterations from the design rules that may compromise the integrity and stability. At present, inspections are primarily visual and are conducted by site manager or accredited personnel to identify deviations. However, visual inspection is time-intensive and can be susceptible to human errors, which can lead to unsafe conditions. This paper explores the use of Artificial Intelligence (AI) and digitization to enhance the accuracy of scaffolding inspection and contribute to the safety improvement. A cloud-based AI platform is developed to process and analyse the point cloud data of scaffolding structure. The proposed system detects structural modifications through comparison and evaluation of certified reference data with the recent point cloud data. This approach may enable automated monitoring of scaffolding, reducing the time and effort required for manual inspections while enhancing the safety on a construction site.

data mining, machine learning, point cloud data, (18 more...)

arXiv.org Artificial Intelligence

2509.21368

Country: Europe (0.28)

Genre:

Research Report (0.64)
Workflow (0.46)

Industry:

Construction & Engineering (0.69)
Health & Medicine (0.46)
Information Technology > Services (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
(3 more...)

Add feedback

Scaling Up Active Testing to Large Language Models

Berrada, Gabrielle, Kossen, Jannik, Razzak, Muhammed, Smith, Freddie Bickford, Gal, Yarin, Rainforth, Tom

arXiv.org Machine LearningAug-13-2025

Active testing enables label-efficient evaluation of models through careful data acquisition. However, its significant computational costs have previously undermined its use for large models. We show how it can be successfully scaled up to the evaluation of large language models (LLMs). In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning, does not require updating within an active-testing loop, and can be smaller than the target model. We even find we can make good data-acquisition decisions without computing predictions with the target model and further introduce a single-run error estimator to asses how well active testing is working on the fly. We find that our approach is able to more effectively evaluate LLM performance with less data than current standard practices.

large language model, machine learning, natural language, (19 more...)

arXiv.org Machine Learning

2508.09093

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > Spain (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.33)

Add feedback